Statistical issues in machine learning: towards reliable split selection and variable importance measures
نویسنده
چکیده
Recursive partitioning methods from machine learning are being widely applied in many scientific fields such as, e.g., genetics and bioinformatics. The present work is concerned with the two main problems that arise in recursive partitioning, instability and biased variable selection, from a statistical point of view. With respect to the first issue, instability, the entire scope of methods from standard classification trees over robustified classification trees and ensemble methods such as TWIX, bagging and random forests is covered in this work. While ensemble methods prove to be much more stable than single trees, they also loose most of their interpretability. Therefore an adaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a single tree if the partition is sufficiently stable. With respect to the second issue, variable selection bias, the statistical sources of this artifact in single trees and a new form of bias inherent in ensemble methods based on bootstrap samples are investigated. For single trees, one unbiased split selection criterion is evaluated and another one newly introduced here. Based on the results for single trees and further findings on the effects of bootstrap sampling on association measures, it is shown that, in addition to using an unbiased split selection criterion, subsampling instead of bootstrap sampling should be employed in ensemble methods to be able to reliably compare the variable importance scores of predictor variables of different types. The statistical properties and the null hypothesis of a test for the random forest variable importance are critically investigated. Finally, a new, conditional importance measure is suggested that allows for a fair comparison in the case of correlated predictor variables and better reflects the null hypothesis of interest.
منابع مشابه
Machine learning algorithms in air quality modeling
Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...
متن کاملFeature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine
Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods. In filter methods, features subsets are selected due to some measu...
متن کاملUsing Feature Weights to Improve Performance of Neural Networks
Different features have different relevance to a particular learning problem. Some features are less relevant; while some very important. Instead of selecting the most relevant features using feature selection, an algorithm can be given this knowledge of feature importance based on expert opinion or prior learning. Learning can be faster and more accurate if learners take feature importance int...
متن کاملMachine learning based Visual Evoked Potential (VEP) Signals Recognition
Introduction: Visual evoked potentials contain certain diagnostic information which have proved to be of importance in the visual systems functional integrity. Due to substantial decrease of amplitude in extra macular stimulation in commonly used pattern VEPs, differentiating normal and abnormal signals can prove to be quite an obstacle. Due to developments of use of machine l...
متن کاملParty on ! A New
Random forests are one of the most popular statistical learning algorithms, and a variety of methods for fitting random forests and related recursive partitioning approaches is available in R. This paper points out two important features of the random forest implementation cforest available in the party package: The resulting forests are unbiased and thus preferable to the randomForest implemen...
متن کامل